Introduction to Open Data Science - Course Project

About the project

This course covers the basics of reproducible research with R, Rstudio, and Github. My Github repository can be found here: https://github.com/viljahe/IODS-project

date()
## [1] "Thu Nov 26 14:12:05 2020"

I found this course by browsing courses for doctoral students, trying to find something that would actually benefit me in some way: I’m interested in open science, and while I have been using R for some years, Git is something I’m not yet familiar with but have been wanting to learn for a while now. So this was a good opportunity to do just that.

I’m looking forward to learning how Git can make my research more open and more reproducible. I hope I can adapt things I learn from this course as a part of my workflow. Three things I’m most looking forward to/goals for myself for this course:

  1. Be more familiar and more efficient with RMarkdown, figure out how RMarkdown could benefit my workflow
  2. Get to know Github so that I understand it’s capabilities and it doesn’t seem quite so scary anymore
  3. Figure out what else I should learn in the future: what are the next steps after this course, which tools or skills would benefit me the most

Regression and model validation

This week we learned about data wrangling and linear regression. First, we created a data set to be used in the analyses below including creating some summary variables. Then we learned how to conduct a simple multivariate linear regression and how to assess the model diagnostics.

# load the necessary packages and set scientific notation off

options(scipen = 999)

library("GGally")
library("ggplot2")

Inspecting the data

First, let’s load the data set created for this exercise to R and inspect its dimension and structure using dim() and str():

learning2014 <- read.table("learning2014.txt")
dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...

The data contains information from 166 students from an introductory statistics course. Data include participants’ gender and age, a measure of global attitudes towards statistics, participants’ scored points in the exam, and three variables which reflect deep learning, surface learning, and strategic learning. The three variables are the mean of several original items asking about participant’s learning approaches in each of the three categories, and the attitude variable is a mean of ten original variables asking about different attitudes towards statistics.

We can inspect the data and the associations between variables visually using the function ggpairs() and look at the most common descriptive statistics with the help of summary() to get a more complete grasp of the data.

p <- ggpairs(learning2014, 
             mapping = aes(col = gender, alpha = 0.1), 
             lower = list(combo = wrap("facethist", bins = 20)),
             upper = list(continuous = wrap("cor", size = 2.5)),
             ) + 
  theme_bw() + 
  theme(strip.background = element_rect(fill = "white"), 
        axis.text.x = element_text(size=5), 
        axis.text.y = element_text(size=5))
p

summary(learning2014)
##     gender               age           attitude         points     
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   : 7.00  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:19.00  
##  Mode  :character   Median :22.00   Median :3.200   Median :23.00  
##                     Mean   :25.51   Mean   :3.143   Mean   :22.72  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:27.75  
##                     Max.   :55.00   Max.   :5.000   Max.   :33.00  
##       deep            surf            stra      
##  Min.   :1.583   Min.   :1.583   Min.   :1.250  
##  1st Qu.:3.333   1st Qu.:2.417   1st Qu.:2.625  
##  Median :3.667   Median :2.833   Median :3.188  
##  Mean   :3.680   Mean   :2.787   Mean   :3.121  
##  3rd Qu.:4.083   3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.917   Max.   :4.333   Max.   :5.000

The data seems to contain more women than men, and the participants are mostly under the age of 30. We can see that there are some differences between men and women on how the variables are distributed, most visibly in the global attitudes towards statistics where men have somewhat higher score on average, and the scores are distributed more evenly for women. Comparing the learning approaches, it would seem that deep learning scores skew a bit more towards the higher scores, whereas strategic scores are distributed more evenly around the middle. For men, the surface learning scores seem to be distributed more evenly than for women who have a more visible peak around the middle values of the variable.

There is a significant positive correlation between the global attitudes towards statistics and the exam score (\(r = .437\)), and this is also true for both gender groups. Attitudes towards statistics also seem to be slightly correlated with surface learning with more positive attitude being associated with lower score on surface learning (\(r = -.176\)), but upon closer inspection, it seems that there is no significant correlation for women, and for men the association is stronger than for all participants’ together (\(r = -.374\)). There is also an expected correlation between surface learning and deep learning, with higher scores on surface learning being associated with lower scores on deep learning (\(r = -.324\)), but surprisingly this effect is once again mostly driven by a strong association for men (\(r = -.622\)) whereas for women the learning approaches are not associated with each other.

Linear regression

The associations between variables can be further examined using linear regression. For example, we can see how exam score is explained by the variables that have the strongest correlation with it: attitudes towards statistics, the surface learning approach, and the strategic learning approach.

model1 <- lm(points ~ attitude + surf + stra, data = learning2014)
summary(model1)
## 
## Call:
## lm(formula = points ~ attitude + surf + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991      0.00322 ** 
## attitude      3.3952     0.5741   5.913 0.0000000193 ***
## surf         -0.5861     0.8014  -0.731      0.46563    
## stra          0.8531     0.5416   1.575      0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 0.00000003156

Based on the results of the regression, it would seem that more positive attitudes towards statistics is associated with higher scores on the exam (\(B = 3.40, p <.001\)). This means, that for each one point increase on the scale of attitudes, the exam score increases approximately 3.4 points. The two learning approaches included in the model do not seem to be associated with exam scores. They can therefore be removed from the model.

# remove surf first as it's p-value is the higher of the two
model2 <- update(model1, ~. -surf)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745       0.00025 ***
## attitude      3.4658     0.5652   6.132 0.00000000631 ***
## stra          0.9137     0.5345   1.709       0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 0.000000007734
# stra is still not significant, remove it as well
model3 <- update(model2, ~. -stra)
summary(model3)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 0.00000000195 ***
## attitude      3.5255     0.5674   6.214 0.00000000412 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 0.000000004119

With the surface learning approach first removed from the model, the p-value associated with strategic learning approach decreases, but the effect remains non-significant at the traditional significance level of \(p < .05\). Removing both non-significant predictors from the model does not affect the association between attitudes towards statistics and exam scores greatly. More positive attitudes towards statistics are associated with higher exam scores, and the magnitude of the effect remains roughly the same (\(B = 3.53, p <.001\)). The multiple \(R^2\) is a measure of goodness-of-fit of our model, i.e. it tells how well our model explains the variance of the dependent variable. From the \(R^2\) we can infer that the model explains approximately 19% of the variance of the exam score.

Model diagnostics

Let’s then examine how well our model meets the assumptions of linear regression. We can do this by inspecting diagnostic plots:

# dev.new(width=7, height=20)
par(mfrow=c(2,2))
plot(model3, c(1, 2, 5))

In the first figure the residuals are compared to the predicted values of the model. The assumption of constant variance of errors is met as the errors are quite randomly spread. From the QQ-plot we can see, that the errors are reasonably normally distributed, following the dashed line quite well. In the third figure we see that no values have a very high leverage, so this assumption is met as well. Based on these, it can be concluded that there are no significant problems with the model.


Logistic regression

This week was all about joining two datasets together wrangling and learning logistic regression. Again, in the data wrangling part we created a data set to be used in the analyses below merging two datasets together and creating some new variables. Then we learned how to conduct a logistic regression and assess the predictive power of the model.

# load the necessary packages and set scientific notation off

options(scipen = 999)

library("dplyr")
library("tidyr")
library("ggplot2")
library("gridExtra")
library("Hmisc")

First look at the data

This data consists information about students’ alcohol consumption, school performance in two subjects (mathematics and portuguese language), and some background variables relating to their social environment, family, and school. Data has been collected from two Portuguese schools. More information of the data used can be found here: https://archive.ics.uci.edu/ml/datasets/Student+Performance

alc <- read.csv("alc.csv", sep=",", header=TRUE)

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures.m"
## [16] "schoolsup"  "famsup"     "paid.m"     "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences.m"
## [31] "G1.m"       "G2.m"       "G3.m"       "failures.p" "paid.p"    
## [36] "absences.p" "G1.p"       "G2.p"       "G3.p"       "id.p"      
## [41] "failures"   "paid"       "absences"   "G1"         "G2"        
## [46] "G3"         "alc_use"    "high_use"

The data has been merged from two separate datasets, one specifically for math and other for portuguese language. The background variables are equal between the two original datasets, but the variables relating to school performance (failures, paid, absences, G1, G2, and G3) on these two subjects differ. These variables are the number of past class failures (failures), whether or not student has had extra paid classes (paid), number of absences (absences), and the first, second, and final grades (G1-G3, respectively). The final dataset used here therefore has three separate variables for each subject specific original variable: the two original variables denoted by either .p (for Portuguese) or .m (for math) and one that is the mean of the two original variables (without suffix).

I chose to examine how time spent on studies weekly (studytime), parents’ cohabitation status (Pstatus), number of school absences (absences) and the mean of the final grades in math and Portuguese (G3) affect whether student’s alcohol consumption is high or low (high_use). Study time is a discrete variable with 4 categories: less than 2 hours (1), 2-5 hours (2), 5-10 hours (3), and more than 10 hours (4). Parents’ cohabitation status has two possible values: living together (T) and living apart (A). Number of school absences is a continuous variable ranging from 0 to 93, and alcohol consumption is a categorical variable based on the mean of workday alcohol consumption and weekend alcohol consumption: those with mean of the two variables higher than 2 (ranging from 1, “very low”, to 5 “very high”) have been categorized as having high alcohol consumption.

My hypotheses are that more time spent on studying and higher grades are connected to low alcohol consumption, while higher number of absences and parents living apart are connected to high alcohol consumption.

Distributions and relationships between variables

Let’s have a closer look at the chosen variables and how they relate to each other.

# subset for only relevant variables
alc.sub <- subset(alc, select = c(studytime, absences, G3, Pstatus, high_use))

# see numerical summaries of variables
summary(alc.sub[(-4)])
##    studytime        absences            G3         high_use      
##  Min.   :1.000   Min.   : 0.000   Min.   : 0.00   Mode :logical  
##  1st Qu.:1.000   1st Qu.: 1.000   1st Qu.:10.00   FALSE:259      
##  Median :2.000   Median : 3.000   Median :12.00   TRUE :111      
##  Mean   :2.043   Mean   : 4.511   Mean   :11.52                  
##  3rd Qu.:2.000   3rd Qu.: 6.000   3rd Qu.:14.00                  
##  Max.   :4.000   Max.   :45.000   Max.   :18.00
table(alc.sub["Pstatus"])
## 
##   A   T 
##  38 332
# barplots for the distributions

# library(purrr)

numplot <- alc.sub[c("G3", "studytime", "absences")]  %>% gather() %>% ggplot(aes(as.numeric(value))) + facet_wrap("key", scales = "free") + geom_bar() + xlab("value") + theme_bw() + theme(strip.background = element_rect(fill = "white"))


discplot <- alc.sub[c("Pstatus", "high_use")]  %>% gather() %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme_bw() + theme(strip.background = element_rect(fill = "white"))

grid.arrange(numplot, discplot, nrow=2)

We can see that most of the students have quite low number of absences, with a few outliers who have over 40. The mean of the final grades is 11.5, and the distribution is somewhat skewed towards higher grades. Study time is mostly below 5 hours per week. There are around 1.5 times more of those whose alcohol consumption is categorized as low compared to high and most students’ parents live together.

# box plots for alcohol use vs. continuous variables

box1 <- ggplot(alc.sub, aes(x = high_use, y = G3)) + 
  geom_boxplot() + 
  ylab("grade") +
  ggtitle("Alcohol use and final grades") +
  theme_bw()

box2 <- ggplot(alc.sub, aes(x = high_use, y = absences)) + 
  geom_boxplot() + 
  ylab("absences") +
  ggtitle("Alcohol use and absences") + 
  theme_bw()

# bar plots for alcohol use vs. discrete variables

bar1 <- alc.sub %>% ggplot(aes(x = studytime, fill = high_use)) + 
  geom_bar(position = "dodge") + 
  xlab("Study time") +
  ggtitle("Alcohol use and study time") + 
  theme_bw()

bar2 <- alc.sub %>% ggplot(aes(x = Pstatus, fill = high_use)) + geom_bar(position = "dodge") +
  xlab("Parental status") +
  ggtitle("Alcohol use and parental status") +
  theme_bw()


# draw the plots

grid.arrange(box1, box2, bar1, bar2, ncol=2)

From the initial glance it doesn’t seem like alcohol use and mean of the final grades or number of absences are connected. The study time per week seems to be a bit lower for those whose alcohol consumption is high, and those whose parents live apart might be overrepresented in high alcohol users as well, although as the number of parents living apart in this data is very low it’s hard to draw any definite conclusions based on these figures alone.

# correlations for numerical variables

corrs <- rcorr(as.matrix(alc.sub[(-4)]))
corrs
##           studytime absences    G3 high_use
## studytime      1.00    -0.12  0.17    -0.21
## absences      -0.12     1.00 -0.10     0.22
## G3             0.17    -0.10  1.00    -0.13
## high_use      -0.21     0.22 -0.13     1.00
## 
## n= 370 
## 
## 
## P
##           studytime absences G3     high_use
## studytime           0.0254   0.0007 0.0000  
## absences  0.0254             0.0484 0.0000  
## G3        0.0007    0.0484          0.0112  
## high_use  0.0000    0.0000   0.0112
# crosstabs for parental status and high use

table(alc.sub[c("Pstatus", "high_use")])
##        high_use
## Pstatus FALSE TRUE
##       A    26   12
##       T   233   99

The rcorr() function prints us the correlations and the associated p-values of the variables. We can see that both study time and absences are statistically significantly correlated with alcohol use, with more time spent on studies being associated with low alcohol consumption (\(r = -.21, p<.001\)) and higher number of absences with high alcohol consumption (\(r = .22, p<.001\)). This would seem to support my hypotheses. Higher grades are also associated with more time spent on studying (\(r = .17, p<.001\)), but not statistically significantly to alcohol use. Although the direction of the association between mean of final grades and alcohol use is in line with the hypothesis.

The proportion of high alcohol users in both parental residential status groups is approximately the same, which does not support my hypothesis.

Logistic regression

We can then use logistic regression to see if the results support my hypotheses.

# first, specify studytime as factor as it is actually a categorical variable
alc.sub$studytime <- as.factor(alc.sub$studytime)

# specify model
model1 <- glm(high_use ~ studytime + absences + G3 + Pstatus, data = alc.sub, family = "binomial")

# print summary
modelsum1 <- summary(model1)

modelsum1
## 
## Call:
## glm(formula = high_use ~ studytime + absences + G3 + Pstatus, 
##     family = "binomial", data = alc.sub)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1429  -0.8596  -0.6513   1.1840   2.1990  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.22075    0.62094  -0.356 0.722211    
## studytime2  -0.46568    0.26686  -1.745 0.080987 .  
## studytime3  -1.36077    0.44121  -3.084 0.002041 ** 
## studytime4  -1.28496    0.58511  -2.196 0.028086 *  
## absences     0.07614    0.02311   3.294 0.000986 ***
## G3          -0.05500    0.03682  -1.494 0.135216    
## PstatusT     0.13698    0.40095   0.342 0.732632    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 416.24  on 363  degrees of freedom
## AIC: 430.24
## 
## Number of Fisher Scoring iterations: 4
# retrieve odds ratios, confidence intervals and p-values
OR <- coef(model1) %>% exp()
CI <- confint(model1) %>% exp()
p <- coef(modelsum1)[,4]

# print table of odds ratios, confidence intervals and p-values
cbind(OR, CI, p) %>% round(3)
##                OR 2.5 % 97.5 %     p
## (Intercept) 0.802 0.233  2.684 0.722
## studytime2  0.628 0.372  1.061 0.081
## studytime3  0.256 0.102  0.585 0.002
## studytime4  0.277 0.076  0.797 0.028
## absences    1.079 1.034  1.132 0.001
## G3          0.946 0.880  1.017 0.135
## PstatusT    1.147 0.535  2.608 0.733

In the specified model, weekly study time and absences are both associated statistically significantly to alcohol use. Those who study 5-10 hours (\(OR = 0.26, CI = 0.10, 0.59, p =.002\)) and over 10 hours per week (\(OR = 0.28, CI = 0.08, 0.80, p =.028\)) are less likely to be high alcohol users than those who study less than 5 hours per week, while those with more absences are more likely to be high alcohol users (\(OR = 1.08, CI = 1.04, 1.13, p <.001\)). The mean of final grades and parental status are not associated with alcohol use.

The hypotheses regarding study time and absences was confirmed, but the results don’t support the hypotheses about final grades and parental residential status being associated with alcohol use.

Predictive power of the model

Only two of the hypothesized variables were connected to alcohol use. Let’s see how well a model which includes these two variables predicts alcohol use.

model2 <- update(model1,.~. -G3 -Pstatus)
summary(model2)
## 
## Call:
## glm(formula = high_use ~ studytime + absences, family = "binomial", 
##     data = alc.sub)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1907  -0.8577  -0.7137   1.1984   2.1174  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.69160    0.23897  -2.894 0.003802 ** 
## studytime2  -0.50842    0.26455  -1.922 0.054631 .  
## studytime3  -1.43763    0.43660  -3.293 0.000992 ***
## studytime4  -1.36272    0.58234  -2.340 0.019279 *  
## absences     0.07788    0.02305   3.379 0.000727 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 418.70  on 365  degrees of freedom
## AIC: 428.7
## 
## Number of Fisher Scoring iterations: 4
# predict() the probability of high_use
probabilities <- predict(model1, type = "response")

# add the predicted probabilities to 'alc'
alc.sub <- mutate(alc.sub, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc.sub <- mutate(alc.sub, prediction = ifelse(probability > 0.5, T , F))

# 2x2 cross table of alcohol use vs. the predictions
table(high_use = alc.sub$high_use, prediction = alc.sub$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   247   12
##    TRUE     93   18

From the crosstabulation, we can see that the majority fall into the category where both predicted and actual alcohol use was low. Of the actual high alcohol users, most were predicted to be low alcohol users. All together, 93 persons were wrongly classified as low alcohol users, and 12 persons were wrongly classified as high alcohol users. This is not very surprising, given that the number belonging to each category in our data was very unequal, with most students being low alcohol users.
We can also examine predictions and actual alcohol use visually by drawing a scatterplot, and calculate the percentage of wrongly categorized individuals:

# explore the predictions graphically
g.pred <- ggplot(alc.sub, aes(x = probability, y = high_use, col = prediction))
g.pred + geom_point(alpha = 0.5) + geom_jitter(height=0.1) + theme_bw()

# calculate and print the total proportion of wrong classifications
prop_wrong <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  perc <- sum(n_wrong)/length(class) * 100
  print(paste(round(perc,2), "% classified wrong"))
}

prop_wrong(class = alc.sub$high_use, prob = alc.sub$probability)
## [1] "28.38 % classified wrong"

The percentage of wrongly categorized individuals based on our model is 28.38 %. Although this number is quite high, it is notably better than simply guessing which would result in approximately 50% wrong guesses.


Clustering and classification

This week we learned about clustering and classification methods.

# load the necessary packages and set scientific notation off

options(scipen = 999)

library("MASS")
library("ggplot2")
library("dplyr")
library("GGally")

Overview of the data

In this week analyses we use the Boston dataset from Mass package. The data describes housing values in suburbs of Boston from the 1970s. More information can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

# load the data

data("Boston")

# explore the structure and dimensions
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The Boston data consists 506 observations and 14 variables, which include town’s per capita crime rate, a number of variables reflecting the average socioeconomic status of the area and the demographics or the residents, general information about the location (e.g. proximity highways and the Charles River) and air quality.

Let’s have a closer look at the data and the variables.

plot1 <- ggpairs(Boston, 
             mapping = aes(alpha = 0.1),
             lower = list(continuous = wrap("points", size = 0.3)),
             upper = list(continuous = wrap("cor", size = 2.5)),
             ) + 
  theme_bw() + 
  theme(strip.background = element_rect(fill = "white"), 
        axis.text.x = element_text(size=4), 
        axis.text.y = element_text(size=5))

plot1

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Some of the variables are very heavily skewed towards lower values, most notably crim (the per capita rate by town), zn (proportion of residential land zoned for large lots) and chas (a dummy variable describing whether or not the area is next to the Charles River). Therefore the data consists of areas that mostly have low crime rates, smaller residential lots, and are not next to the river. The variables dis (distance to employment centers) and lstat (percent of the lower status of the population) are also skewed towards the lower values, so most areas are close to the employment centers and have a low percent of low status residents. On the other hand, variables age, ptratio, and black are skewed towards higher values, meaning the data includes more areas with older houses, high pupil-teacher ratio, and fewer black people. A couple of variables have very clear two peaks: indus (proportion of non-retail business acres), rad (accessibility to radial highways), and tax (property tax-rate). The variable nox, describing the nitrogen oxides concentration also has mostly quite low values, while medv, median value of owner-occupied homes, has its peak around the middle values.

There are a number of substantial correlations between the variables, and we can instantly see some interesting patterns in the data from the scatterplots, e.g. the nitrogen oxides concentration seems to be higher when the houses in the area are older, and lower when the distance to the employment centres is higher, and higher percent of the population that is of lower status seems to be associated with a decrease in the median value of owner-occupied homes. Nothing too surprising.

Standardizing the data & creating train and test datasets

Next we can standardize the data using the function scale.

boston_scaled <- scale(Boston)

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

This takes each value, calculates the difference between it and the variable mean, and divides it by the standard deviation of the variable. The scaled values represents by how many standard deviations does the observation differ from the variable mean. The means of each scaled variables is now 0 and the values that were originally below the mean are negative, and values above the mean are positive. For example, the maximum value of crime rates in the original data is 88.9762, which is 85.3626764 above the mean: the standard deviation of crime rates is 8.6015451, so the maximum value is 85.3626764 / 8.6015451 = 9.9241096 above the mean. This is the new maximum value in the scaled data.

Let’s then replace the crime rate variable with a categorical variable of the crime rate using quantiles as the break points.

# change the class to dataframe
boston_scaled <- as.data.frame(boston_scaled)

# define the breaks
bins <- quantile(boston_scaled$crim)

# create the categorical variable
boston_scaled$crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# remove the original crime rate variable
boston_scaled <- dplyr::select(boston_scaled, -crim)

Linear discriminant analysis

First, we will divide the dataset to train and test sets: we will first use the train set to do train our classification model and then see how well our model can predict the classes in the test dataset.

# choose randomly 80% of the rows in the data for the train set
ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set
test <- boston_scaled[-ind,]

Next, we will fit the linear discriminant analysis on the train set using the categorical crime rate variable crime as the target variable, and all the other variables as predictors.

# Fit the model
lda.crime <- lda(crime ~., data = train)

lda.crime
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2524752 0.2400990 0.2574257 
## 
## Group means:
##                   zn      indus         chas        nox          rm        age
## low       0.98289371 -0.9048904 -0.077423115 -0.8875355  0.44606813 -0.8480603
## med_low  -0.05489573 -0.3067947 -0.002135914 -0.6118687 -0.14512558 -0.4031577
## med_high -0.37938430  0.1687756  0.214734879  0.4008474  0.03253615  0.4187731
## high     -0.48724019  1.0170690 -0.007331936  1.0733064 -0.39638855  0.7963535
##                 dis        rad        tax    ptratio       black       lstat
## low       0.8692695 -0.6896375 -0.7531292 -0.5076793  0.38074461 -0.75932271
## med_low   0.4143469 -0.5517590 -0.5091792 -0.1384105  0.31451083 -0.12411300
## med_high -0.3658826 -0.4253975 -0.3115961 -0.3008897  0.07578919  0.04279428
## high     -0.8537986  1.6386213  1.5144083  0.7813507 -0.85054697  0.86151464
##                 medv
## low       0.54492676
## med_low   0.01860534
## med_high  0.11144004
## high     -0.69494268
## 
## Coefficients of linear discriminants:
##                 LD1         LD2        LD3
## zn       0.09147076  0.69859943 -1.0425885
## indus    0.03828461 -0.08090308  0.4437025
## chas    -0.07753352 -0.05415673  0.1023825
## nox      0.37613747 -0.84622651 -1.3336012
## rm      -0.12901120 -0.05574213 -0.1490547
## age      0.19891883 -0.30826376 -0.2482416
## dis     -0.04159272 -0.22088373  0.2985786
## rad      3.29600927  1.12743424  0.0381204
## tax      0.12024734 -0.20148469  0.4709778
## ptratio  0.11564493 -0.01173114 -0.3306234
## black   -0.12303903  0.05421779  0.1376763
## lstat    0.16288020 -0.11892018  0.4530085
## medv     0.18521186 -0.29111204 -0.1273527
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9527 0.0336 0.0137
# Draw the LDA biplot

# target classes as numeric
classes <- as.numeric(train$crime)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "black", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# define the colors for the plot from green (low) to red (high)
cols <- c("green", "blue", "purple", "red")
cols1 <- cols[classes]

# plot the lda results
plot(lda.crime, dimen= 2, col = cols1, pch = classes)
lda.arrows(lda.crime, myscale = 2)

We can see from the plot that high crime rates cluster together very clearly separate from the other categories. The other categories also form clusters, although there is much more overlap between the categories. The arrows show that the variable rad (the accessibility of radial highways) contributes most to the separation of the groups: those in the high crime rate group also have clearly higher values on this variable compared to other groups. Variables zn and nox seem to contribute more to the separation between the low, medium low, and medium high groups, with higher nitrogen oxides levels being related to higher crime rates, and proportion of large residential plots being related to lower crime rates. The same can be seen from the table of group means.

Predictions

# save crime rate categories from test data
correct_classes <- test$crime

# remove original crime rate categories
test <- dplyr::select(test, -crime)

# predict classes with the model on the test data
lda.pred <- predict(lda.crime, newdata = test)

# crosstabulate results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15      11        0    0
##   med_low    2      12       10    0
##   med_high   0       7       20    2
##   high       0       0        0   23

It looks like that the predictions are pretty good: in all categories, the correct predictions are in the majority. The “high” group was predicted most accurately, which can also be seen from the previous plot, as the high values formed most clearly a separate cluster.

K-means clustering

Let’s reload the Boston data, standardize it again, inspect the distances, and run the k-means algorithm on this data.

# reload the Boston data
data("Boston")

# standardize the dataset
boston_scaled <- scale(Boston)

# calculate distances between observations
dist_eu <- dist(boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# run the k-means algorithm
km <- kmeans(boston_scaled, centers = 4)
km
## K-means clustering with 4 clusters of sizes 144, 188, 34, 140
## 
## Cluster means:
##         crim         zn      indus       chas        nox         rm        age
## 1 -0.4027600  1.0605053 -0.9558406 -0.2449881 -0.8845708  0.8024032 -1.0064759
## 2 -0.3722714 -0.4023982 -0.1279278 -0.2723291 -0.1953028 -0.2837133  0.1233787
## 3 -0.1985497 -0.2602436  0.2799956  3.6647712  0.3830784  0.2756681  0.3721322
## 4  0.9623940 -0.4872402  1.0869401 -0.2723291  1.0790746 -0.5112906  0.7791774
##           dis          rad        tax    ptratio      black        lstat
## 1  0.96581780 -0.589478299 -0.6726655 -0.6925272  0.3580834 -0.899561938
## 2 -0.04204253 -0.594569021 -0.5239892  0.1193084  0.2658321 -0.005441924
## 3 -0.40333817  0.001081444 -0.0975633 -0.3924585  0.1715427 -0.164352523
## 4 -0.83900192  1.404479156  1.4192210  0.6474109 -0.7669493  0.972485618
##         medv
## 1  0.8660455
## 2 -0.1834340
## 3  0.5733409
## 4 -0.7837039
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   2   1   1   1   1   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   1 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   1   1   1   1   2   2   2   2   2   2   1   1   1   1   1   1   1   1   1   1 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   2   2   1   1   1   1   1   1   2   1   1   2   1   1   2   2   2   2   2   2 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   1   1   1   1   2   2   2   2   2   1   2   2   2   1   2   2   2   1   1   1 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   4   2   2   2   2   2 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   2   4   3   4   4   4   4   4   4   4   2   2   3   4   3   3   4   2   2   2 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   3   1   3   3   2   2   1   2   2   2   2   2   2   2   2   1   2   2   2   1 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
##   1   2   1   2   2   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 
##   1   1   1   1   1   2   2   2   3   3   3   3   3   2   2   2   3   2   3   3 
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 
##   3   3   3   2   1   1   1   2   1   1   2   1   1   1   3   2   3   1   1   1 
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 
##   1   1   1   1   2   2   1   2   1   1   1   1   1   1   1   1   1   1   1   2 
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 
##   1   1   1   1   1   2   2   1   1   3   2   1   2   3   3   1   3   3   1   1 
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 
##   1   1   3   1   1   1   1   1   1   1   1   1   1   1   2   1   1   2   1   1 
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 
##   1   1   1   1   1   1   1   1   2   2   2   2   2   2   2   2   2   2   2   2 
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 
##   2   2   2   2   2   1   2   2   2   1   1   1   1   2   2   2   2   2   2   2 
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 
##   2   1   2   1   1   2   2   1   1   1   1   1   1   1   1   1   3   3   3   4 
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 
##   4   4   4   3   3   4   4   4   4   3   3   4   3   4   4   4   4   4   4   4 
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 
##   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4 
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 
##   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4 
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 
##   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4 
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 
##   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4 
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 
##   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4 
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 
##   4   4   4   4   4   4   4   4   4   4   4   4   4   2   2   2   2   2   2   2 
## 501 502 503 504 505 506 
##   2   2   2   2   2   2 
## 
## Within cluster sum of squares by cluster:
## [1] 1016.7319  844.8909  340.7321 1251.0784
##  (between_SS / total_SS =  51.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The k-means algorithm needs the number of clusters to be specified. I chose 4 clusters at random, but there is a better way to determine the appropriate number of clusters. We can do this by inspecting the total within cluster sum of squares (TWCSS):

set.seed(42)

# determine the maximum number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The plot shows TWCSS on y-axis and how it changes with the number of clusters on x-axis. Here I chose to plot the TWCSS for 1-10 clusters (as it is very unlikely that the solution would be better for a larger number of clusters). The rule of thumb for choosing the appropriate number of clusters is to find a “knot” in the line, e.g. the point where there is a big drop in the TWCSS. In this case, it seems that 2 is a good number of clusters.

Let’s run the k-means algorithm with two clusters.

# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
km
## K-means clustering with 2 clusters of sizes 329, 177
## 
## Cluster means:
##         crim         zn      indus         chas        nox         rm
## 1 -0.3894158  0.2621323 -0.6146857  0.002908943 -0.5823397  0.2446705
## 2  0.7238295 -0.4872402  1.1425514 -0.005407018  1.0824279 -0.4547830
##          age        dis        rad        tax    ptratio      black      lstat
## 1 -0.4331555  0.4540421 -0.5828749 -0.6291043 -0.2943707  0.3282754 -0.4530491
## 2  0.8051309 -0.8439539  1.0834228  1.1693521  0.5471636 -0.6101842  0.8421083
##         medv
## 1  0.3532917
## 2 -0.6566834
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   1   1   1   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   1   1   2   2   2   1   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   1   1   2 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   1   1   1   1   1   2   1   2   1   1   2   2   1   1   1   1   1   1   1   1 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   2   2   2 
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   1   1   1   1   1   1   1 
## 501 502 503 504 505 506 
##   1   1   1   1   1   1 
## 
## Within cluster sum of squares by cluster:
## [1] 2686.045 1890.637
##  (between_SS / total_SS =  35.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# plot the Boston dataset with clusters

boston_scaled <- as.data.frame(boston_scaled)

plot2 <- ggpairs(boston_scaled, 
             mapping = aes(col = as.factor(km$cluster), alpha = 0.1),
             lower = list(continuous = wrap("points", size = 0.1)),
             upper = "blank",
 switch = "both")+ 
  theme_bw() + 
  theme(strip.background = element_rect(fill = "white"), 
        axis.text.x = element_text(size=4), 
        axis.text.y = element_text(size=5))

plot2

We can see that cluster 1 is described by somewhat higher crime rates and distinctively lower amount of large residential plots, higher nitrogen oxides concentration and more older buildings. It also contains areas with shorter distance to employment centres, higher accessibility to highways and property tax rate, as well as higher pupil-teacher ratio, and higher percent of lower status population. Cluster 2 seems to have low crime rate, high proportion of large residential plots, lower amount of non-retail business land, and lower nitrogen oxides concentration, as well as lower accessibility to highways, and lower number of blacks and percent of low status population. In short, our two clusters seem to describe two types of areas, one with generally more desirable attributes for a residential area and quite likely residents of higher socioeconomic status, and other with less desirable attributes and residents of lower socioeconomic status. If I would have to guess, I’d say these areas probably differ in their proximity to the city proper, with cluster 1 describing more urban areas and cluster 2 describing more suburban areas. ***

Dimensionality reduction

This week we learned about dimensionality reduction techniques, such as principal component analysis and multiple correspondence analysis.

# load the necessary packages and set scientific notation off

options(scipen = 999)

library("dplyr")
library("GGally")
library("tidyr")

Overview of the data

In this week analyses we use a data set that combines the Human Development Index and Gender Inequality Index data sets. More information can be found here: http://hdr.undp.org/en/content/human-development-index-hdi

# load the data
human <- read.csv("~/Desktop/kurssit-syksy-2020/IODS-project/data/human.csv", row.names = 1)

# summaries
summary(human)
##    edu.ratio        lab.ratio         life.exp        edu.exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       gni              mmr           birthrate        parliament   
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

The data consists of 155 countries and 8 variables that reflect human development and gender inequality:

  • edu.ratio is the ratio of the precentage of women to the percentage of men with some secondary education
  • lab.ratio is the ratio of the percentage of women to the percentage of men over the age 15 in labour force
  • life.exp is life expectancy at birth in years
  • edu.exp is expected years of schooling
  • gni is the gross national income (GNI) per capita
  • mmr is maternal mortality ratio
  • birthrate is adolescent birth rate
  • parliament is the percentage of seats in parliament held by women

Let’s have closer look at the distributions of and relationships between the variables.

# graphical overview
plot1 <- ggpairs(human, 
             mapping = aes(alpha = 0.1),
             lower = list(continuous = wrap("points", size = 0.5)),
             upper = list(continuous = wrap("cor", size = 3)),
             ) + 
  theme_bw() + 
  theme(strip.background = element_rect(fill = "white"), 
        axis.text.x = element_text(size=4), 
        axis.text.y = element_text(size=5))

plot1

We can see that in most countries the ratio of women to men with some second degree education is close to 1, which means that almost equal proportion of women and men have secondary education. The labour force ratio, on the other hand, has it’s peak a bit below 1, so a larger proportion men participate in the labour force compared to women. Life expectancy is most commonly around 70-80 years, with an understandable left skew as human lifespan is limited. The expected years of schooling seems to be quite normally distributed, with most common values around 10-15 years. The GNI has a heavy right skew, only couple of countries reaching GNI over 50 000 per capita. Maternal mortality ratio and adolescent birth rate are also skewed to the right, with low values being the most prevalent. Finally, the percentage of parliamentary seats held by women is well below 50 % in most countries.

The secondary education gender ratio seems to correlate with all other variables expect the labour force participation gender ratio and proportion of women in the parliament. Countries with a higher proportion of women with secondary education tend to have higher life expectancy, expected education, and GNI, as well as lower maternal mortality and adolescent birth rate. Higher proportion of women participating in the labour force is associated with higher maternal mortality (which seems counter intuitive and might be some sort of statistical fluke) and higher proportion of women in the parliament. Higher life expectancy is associated with higher expected education and higher GNI, and lower maternal mortality, and adolescent birth rate. Countries with higher GNI tend to have lower maternal mortality and adolescent birth rate, and high maternal mortality is associated with high adolescent birth rate.

Principal component analysis

Let’s run a principal component analysis on the human development data.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
summary_human <- summary(pca_human)
summary_human
## Importance of components:
##                               PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     18544.1639 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance     0.9999   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion      0.9999   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
# extract the variances captured for the plot
pca_pr1 <-round(100 * summary_human$importance[2, ], digits = 1)
pcalab1 <- paste0(names(pca_pr1), "(",pca_pr1, "%)")

# plot
biplot(pca_human, choices = 1:2, cex = c(0.5, 0.6), col = c("grey80", "black"), xlab = pcalab1[1], ylab = pcalab1[2])
Figure 1. The first two component of PCA on unstandardized data. The 1st component explains 100 % of the variability and is largely determined by the GNI: countries with higher GNI place to the left side of the plot.

Figure 1. The first two component of PCA on unstandardized data. The 1st component explains 100 % of the variability and is largely determined by the GNI: countries with higher GNI place to the left side of the plot.


Well, this seems messy. The first component captures 99.99 % of the variance, and the GNI seems to be the only variable with really any effect. This is because PCA is sensitive to the scaling of the variables: features with large variance are assumed to be more important than those with small variance. And the GNI, as it is on a completely different scale than the rest of the variables with values ranging from hundreds to hundreds of thousands, most definitely has the largest variance in our data. It’s probably better to standardize the data so that the scales are comparable.

# Standardize the dataset
human_std <- scale(human)

# perform pca to the standardized data
pca_humanstd <- prcomp(human_std)
summary_humanstd <- summary(pca_humanstd)
summary_humanstd
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# extract the variances captured for the plot
pca_pr2 <-round(100 * summary_humanstd$importance[2, ], digits = 1)
pcalab2 <- paste0(names(pca_pr2), "(",pca_pr2, "%)")

# plot
biplot(pca_humanstd, choices = 1:2, cex = c(0.5, 0.6), col = c("grey80", "black"), xlab = pcalab2[1], ylab = pcalab2[2])
Figure 2. The first two component of PCA on standardized data. The 1st component explains 53.6 % of the variability and is determined by the GNI, expected years of schooling, life expectancy, the ratio of the percentages of women to men with some sencondary education, maternal mortality, and adolescent birthrate. The second component explains 16.2 % of the variability, and is determined by the percentage of parliamentary seats held by women, and the ratio of the percentages of women to men in labour force.

Figure 2. The first two component of PCA on standardized data. The 1st component explains 53.6 % of the variability and is determined by the GNI, expected years of schooling, life expectancy, the ratio of the percentages of women to men with some sencondary education, maternal mortality, and adolescent birthrate. The second component explains 16.2 % of the variability, and is determined by the percentage of parliamentary seats held by women, and the ratio of the percentages of women to men in labour force.


This looks better. Now that the variables are on a comparable scales, the first components explains 53.61 % of the variability, the second 16.24 %, and the rest of the six components explain less than 10 % each.

The first principal component is defined by two sets of variables: expected education, life expectancy, secondary education ratio, and the GNI one one hand, and maternal mortality and adolescent birthrate on the other hand. Countries that place higher on the first principle component have lower life expectancy, expected education, GNI, and less women with secondary education compared to men, and also high maternal mortality and adolescent birthrate. The second principal component reflects the labour force participation gender ratio and proportion of women in the parliament. The lower the country is placed on the second principal component, the lower proportion of women they have in the parliament and the labour force. So roughly speaking, in the upper left corner, we have countries that do well on the human development and gender equality indicators, and in the bottom right we would have countries that don’t do so well in terms of human development or gender equality.

The two components could perhaps be labeled well-being and development (1st component) and equal gender representation (2nd component). Although the gender ratio of secondary education could be thought of as an indicator of gender equality, it contributes clearly to the 1st component together with more “general” indicators of development and well-being. This is somewhat surprising, but perhaps the gender equality in secondary education tells us more about the availability of education in general, and gender representation in the parliament and labour force are a better measure of gender equality. If we look at the countries that place to the left on the first component but not very high on the second component, we can see that there are such countries as Japan and Korea: developed countries where traditional gender roles still largely hold.

Multiple correspondence analysis

Multiple correspondence analysis is a dimensionality reduction technique that can be used when data consists of categorical variables. We can try this out with the tea data set from the FactoMineR package.

# install and load package Factominer
require(FactoMineR)
## Loading required package: FactoMineR
# load the dataset ´tea´
data(tea)

# structure and dimensions of the data
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36

The data contains 300 observations, and 36 variables that reflect some tea related habits and some background information. Unfortunately there is very little information available about the exact nature of these variables, or at least I couldn’t find any. Anyway, we can examine the categorical variables (all but age) more closely with barplots.

# visualize the data
gather(tea[,1:12]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle=45, hjust = 1, size = 8))

gather(tea[,c(13:18, 20:25)]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle=45, hjust = 1, size = 8))

gather(tea[,26:36]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle=45, hjust = 1, size = 8))


As the data has so many variables, we can choose a couple to run the MCA with. I decided to go with the variables which I could actually interpret in the absence of proper data documentation, and which seemed to make at least some sense to examine together. I chose the variables Tea reflecting whether the person prefers black, Earl Grey, or green tea; the very confusingly named How and how reflecting whether the person likes their tea plain, with lemon, milk, or something other, and if they use bags, loose leaf or both (respectively); sugar which tells us whether or not person drinks their tea with sugar; and where reflecting whether the person buys their tea from chain store, tea shop, or both.

# As the data contains a lot of variables, let's choose only some for easier interpretations

new.tea <- subset(tea, select = c("Tea", "How", "sugar", "how", "where"))

# MCA
mca <- new.tea %>%  MCA(graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = ., graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.335   0.309   0.257   0.224   0.198   0.184   0.172
## % of var.             16.762  15.441  12.866  11.181   9.924   9.188   8.625
## Cumulative % of var.  16.762  32.204  45.069  56.250  66.174  75.362  83.987
##                        Dim.8   Dim.9  Dim.10
## Variance               0.141   0.105   0.075
## % of var.              7.031   5.249   3.734
## Cumulative % of var.  91.017  96.266 100.000
## 
## Individuals (the 10 first)
##              Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2
## 1         | -0.325  0.105  0.088 | -0.288  0.090  0.069 |  0.305  0.120  0.078
## 2         | -0.259  0.067  0.037 | -0.074  0.006  0.003 |  0.789  0.806  0.343
## 3         | -0.403  0.162  0.242 | -0.283  0.086  0.119 |  0.217  0.061  0.070
## 4         | -0.580  0.334  0.481 | -0.328  0.116  0.154 | -0.290  0.109  0.120
## 5         | -0.403  0.162  0.242 | -0.283  0.086  0.119 |  0.217  0.061  0.070
## 6         | -0.403  0.162  0.242 | -0.283  0.086  0.119 |  0.217  0.061  0.070
## 7         | -0.403  0.162  0.242 | -0.283  0.086  0.119 |  0.217  0.061  0.070
## 8         | -0.259  0.067  0.037 | -0.074  0.006  0.003 |  0.789  0.806  0.343
## 9         |  0.155  0.024  0.012 |  0.974  1.025  0.461 | -0.005  0.000  0.000
## 10        |  0.521  0.270  0.142 |  0.845  0.770  0.373 |  0.612  0.485  0.196
##            
## 1         |
## 2         |
## 3         |
## 4         |
## 5         |
## 6         |
## 7         |
## 8         |
## 9         |
## 10        |
## 
## Categories (the 10 first)
##               Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2  v.test  
## black     |   0.473   3.294   0.073   4.681 |   0.198   0.627   0.013   1.961 |
## Earl Grey |  -0.265   2.689   0.126  -6.147 |   0.088   0.319   0.014   2.033 |
## green     |   0.487   1.558   0.029   2.962 |  -0.956   6.515   0.113  -5.813 |
## alone     |  -0.017   0.011   0.001  -0.405 |  -0.251   2.643   0.117  -5.905 |
## lemon     |   0.668   2.926   0.055   4.059 |   0.458   1.497   0.026   2.787 |
## milk      |  -0.338   1.428   0.030  -3.010 |   0.220   0.659   0.013   1.963 |
## other     |   0.287   0.148   0.003   0.873 |   2.207   9.465   0.151   6.712 |
## No.sugar  |   0.247   1.879   0.065   4.414 |   0.061   0.123   0.004   1.084 |
## sugar     |  -0.264   2.009   0.065  -4.414 |  -0.065   0.132   0.004  -1.084 |
## tea bag   |  -0.607  12.466   0.482 -12.008 |  -0.336   4.134   0.147  -6.637 |
##             Dim.3     ctr    cos2  v.test  
## black       1.045  20.945   0.358  10.342 |
## Earl Grey  -0.462  10.689   0.386 -10.737 |
## green       0.360   1.110   0.016   2.190 |
## alone       0.150   1.136   0.042   3.534 |
## lemon      -1.561  20.842   0.301  -9.491 |
## milk        0.093   0.141   0.002   0.828 |
## other       1.826   7.773   0.103   5.552 |
## No.sugar    0.621  15.481   0.412  11.100 |
## sugar      -0.664  16.548   0.412 -11.100 |
## tea bag     0.069   0.211   0.006   1.367 |
## 
## Categorical variables (eta2)
##             Dim.1 Dim.2 Dim.3  
## Tea       | 0.126 0.115 0.421 |
## How       | 0.076 0.220 0.385 |
## sugar     | 0.065 0.004 0.412 |
## how       | 0.708 0.503 0.008 |
## where     | 0.701 0.702 0.061 |
# plot the 
plot(mca, invisible=c("var"))

plot(mca, invisible=c("ind"), habillage = "quali")


From the first plot we can see how individuals place on the first two dimension. There might be some clustering visible, perhaps one group that is located near the middle on both dimension, second that is higher on the second dimension, a third high on the first dimension but low on the second dimension.

From the second plot we can see how the variables contribute to the dimensions and relate to each other. The first two dimensions account for 16.76 and 15.44 % of the variance. We can see, that buying tea from both, chain store and tea shop and buying both, bags and loose leaf are associated, as are buying tea from tea shop and loose leaf only. They also separate individuals in the second dimension as the two groups are far away from each other in that dimension. Those who use tea bags from chain store are far away from the loose leaf from tea shop persons on the first dimension. The rest of the variables are cluster together pretty closely, except for green tea and the “other” additions to the tea (I still have little clue what this category could contain) which both seem to be linked to the second dimension. If I’m interpreting this correctly, I think we might see some separation between the fancy tea drinkers who like loose leaf from tea shops, the “anything goes” who mix bags and loose leaf from chain stores and tea shops and enjoy it with some mystery condiments, and the regular every day tea drinkers who prefer bags from chain stores and the more common condiments.


(more chapters to be added similarly as we proceed with the course!)